{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Epsilon-greedy\n",
    "\n",
    "We already learned about the epsilon-greedy algorithm in the previous chapters. With the epsilon-greedy, we select the best arm with probability 1-epsilon and we select the random arm with probability epsilon. Let's take a simple example and learn how we find the best arm exactly with the epsilon-greedy method in more detail. \n",
    "\n",
    "Say, we have two arms - arm 1 and arm 2. Suppose, with arm 1 we win the game 80% of the time and with arm 2 we win the game with 20% of the time. So, we can say that arm 1 is the best arm as it makes us win the game 80% of the time. Now, let's learn how to find this with the epsilon-greedy method. \n",
    "\n",
    "First, we initialize the `count` - number of times the arm is pulled, `sum_rewards` - the sum of rewards obtained from pulling the arm, `Q`- average reward obtained by pulling the arm as shown below:\n",
    "\n",
    "\n",
    "![title](Images/1.PNG)\n",
    "\n",
    "\n",
    "\n",
    "## Round 1:\n",
    "\n",
    "Say, in round 1 of the game, we select the random arm with probability epsilon, suppose we randomly pull the arm 1 and observe the reward. Let the reward obtained by pulling the arm 1 be 1. So, we update our table with `count` of arm 1 to 1, `sum_rewards` of arm 1 to 1 and thus the average reward `Q` of the arm 1 after round 1 will be 1 as shown below:\n",
    "\n",
    "\n",
    "![title](Images/2.PNG)\n",
    "\n",
    "\n",
    "## Round 2:\n",
    "\n",
    "Say, in round 2, we select the best arm with probability 1-epsilon. The best arm is the one which has a maximum average reward. So, we check our table as which arm has the maximum average reward, since arm 1 has the maximum average reward, we pull the arm 1 and observe the reward and let the reward obtained from pulling the arm 1 be 1. So, we update our table with `count` of arm 1 to 2, `sum_rewards` of arm 1 to 2 and thus the average reward `Q` of the arm 1 after round 2 will be 1 as shown below:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/3.PNG)\n",
    "## Round 3:\n",
    "\n",
    "Say, in round 3, we select the random arm with probability epsilon, suppose we randomly pull the arm 2 and observe the reward. Let the reward obtained by pulling the arm 2 be 0. So, we update our table with `count` of arm 2 to 1, `sum_rewards` of arm 2 to 0 and thus the average reward `Q` of the arm 2 after round 3 will be 0 as shown below:\n",
    "\n",
    "![title](Images/4.PNG)\n",
    "\n",
    "## Round 4:\n",
    "\n",
    "Say, in round 4, we select the best arm with probability 1-epsilon. So, we pull arm 1 since it has a maximum average reward. Let the reward obtained by pulling arm 1 be 0 this time. Now, we update our table with `count` of arm 1 to 3, `sum_rewards` of arm 2 to 2 and thus the average reward `Q` of the arm 1 after round 4 will be 0.66 as shown below:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/5.PNG)\n",
    "We repeat this process for several numbers of rounds, that is, for several rounds of the game, we pull the best arm with probability 1-epsilon and we pull the random arm with the probability epsilon. The updated table after some 100 rounds of game is shown below:\n",
    "\n",
    "![title](Images/6.PNG)\n",
    "\n",
    "\n",
    "From the above table, we can conclude that arm 1 is the best arm since it has the maximum average reward. \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
